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Abstract. What is the relationship between the complexity of a learner and 
the randomness of his mistakes ? This question was posed in [7 who showed 
that the more complex the learner the higher the possibility that his mistakes 
deviate from a true random sequence. In the current paper we report on an 
empirical investigation of this problem. We investigate two characteristics of 
randomness, the stochastic and algorithmic complexity of the binary sequence 
of mistakes. A learner with a Markov model of order k is trained on a finite 
binary sequence produced by a Markov source of order fc* and is tested on a 
difi"erent random sequence. As a measure of learner's complexity we define a 
quantity called the sysRatio, denoted by p, which is the ratio between the com- 
pressed and uncompressed lengths of the binary string whose i*^ bit represents 
the maximum a posteriori decision made at state i of the learner's model. The 
quantity p is a measure of information density. The main result of the paper 
shows that this ratio is crucial in answering the above posed question. The 
result indicates that there is a critical threshold p* such that when p < p* 
the sequence of mistakes possesses the following features: (1) low divergence 
A from a random sequence, (2) low variance in algorithmic complexity. When 
p > p*, the characteristics of the mistake sequence changes sharply towards a 
high A and high variance in algorithmic complexity. 

1. Overview 

In computer science, the notion of computational complexity serves as a measure 
of how difficult it is to compute a solution for a given problem. Computations take 
time and complexity here means the time rate of growth to solve the problem. An- 
other related kind of complexity measure (studied in theoretical computer science) 
is the so-called algorithmic (or Kolmogorov) complexity which measures how long 
a computer program (on some generic computational machine) needs to be in order 
that it produces a complete description of an object. Interestingly, the theory says 
that if we consider as an object a system that can process input information (avail- 
able as a binary sequence of high entropy) and which produces another sequence 
as an output then the amount of randomness in the output sequence is inversely 
proportional to the algorithmic complexity of the system. 

This has been traditionally studied in the context of algorithmic randomness 
(see [1] and references within) and it has been only until recently unknown whether 
such a relationship between complexity and randomness exists for more general 
systems, for instance, those governed by physical laws. In [5j the complexity of 
a general static system (for instance, a physical solid) is modeled algorithmically, 
i.e., by its description length. Using the model it is proposed that the stability of 
a static system (from the physical perspective) is related to its level of algorithmic 
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complexity. This is explained by the relationship between the complexity of a 
system and its ability to 'distort' the randomness in its environment. The first 
proof of this concept appeared in a recent paper [8j where it was shown that this 
inverse relationship between system complexity and randomness exists also in a 
physical system. The particular system investigated consisted of a one-dimensional 
vibrating solid-beam to which a random sequence of external input forces is applied. 

The current paper is yet another proof of concept of the model of [5j. We proceed 
along the line of [8] but instead of considering a physical system (the static solid 
with input force sequence) we consider a decision system and study its infiuence 
on a random binary data sequence on which prediction decisions are made. The 
decision system is based on the maximum a posteriori probability decision where 
probabilities are defined by a statistical parametric model which is estimated from 
data. The learner of this model is a computer program that trains from a given 
random data sequence and then produces a decision rule by which it is able to 
predict (or decide) the value of the next bit in future (yet unseen) random binary 
sequences. 

While this paper is in the realm of machine-learning we are not proposing a new 
algorithm nor are we interested in the performance of the learner. But rather, our 
interest is in displaying a learning (and decision) system from the perspective of 
static system complexity and its infiuence on random inputs [5j. 

2. Introduction 

Let X^'^^ = Xi, . . . , Xn be a sequence of binary random variables drawn accord- 
ing to some unknown joint probability distribution P (X^^^) . Consider the problem 
of learning to predict the next bit in a binary sequence drawn according to P. For 
training, the learner is given a finite sequence x^'^^ of bits G {0, 1} , 1 < t < m, 
drawn according to P and estimates a model M that can be used to predict the 
next bit of a partially observed sequence. After training, the learner is tested on 
another sequence x^'^^ drawn according to the same unknown distribution P. Using 
M he produces the bit yt as a prediction for , 1 < t < n. Denote by ^"^^^ the cor- 
responding binary sequence of mistakes where = 1 if 7^ Xt and is otherwise. 
In [7j the following question was posed: how random is (^^^^ ? 

It is clear that the sequence of mistakes should be random since the test se- 
quence x^'^^ is random. It may also be that because the learner is using a model 
of a finite structure (or a finite description-length) that it may somehow introduce 
dependencies and cause (^^^^ to be less random than x^^\ And yet by another in- 
tuition, perhaps the fact that the learner is of a finite complexity limits its ability 
to 'deform' (or distort) randomness of x^'^^ ? These are all valid initial guesses 
that relate to this main question. We note that our basis for saying that M has 
a finite structure stems from it being an element of some regular hypothesis class, 
for instance, having a finite VC-dimension as is often the case in a learning setting 
(see for instance structural risk minimization of [lOj). In the current paper, we are 
not interested in the learner's performance (as modeled for instance by Valiant's 
PAC framework [9l [6]) but instead we take a black-box view of a learner and ask 
how much infiuence does he has on the stochastic properties of the errors. We view 
the learner as an entity that 'interferes' with the randomness that is inherent in the 
sequence to be predicted and through his predictions creates a sequence of mistakes 
that has a different stochastic character. This view in a broader sense is taken in 
and is shown (empirically) in [8j to explain how static structures may 'deform' 
random external forces. 

The question raised above was answered in [7J for a particular learning setting 
where the teacher uses a probability distribution P based on a Markov model with a 
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certain complexity. The learner has access to a hypothesis class of Boolean decision 
rules that are based on Markov models. Hence, learning amounts to the estimation 
of parameters of a finite-order Markov model (see for instance [3", T]). The answer 
shows theoretically that the random characteristics of the subsequence of mistakes 
corresponding to the 0-predictions of a learner changes in accordance with the 
complexity of the learner's decision rule's complexity. The more complex the rule 
the higher the possibility of 'distortion' of randomness, i.e., the farther away it is 
from being truly-random. 

In the current paper we take an experimental approach to answering the above 
question. As in [7J we focus on Markov source and a Markov learner whose orders 
may differ. In the next section we describe the setup. 



The learning problem consists of predicting the next bit in a given sequence 
generated by a Markov chain (model) A^* of order /c*. There are 2^ states in the 
model each represented by a word of /c* bits. During a learning problem, the source's 
model is fixed. A learner, unaware of the source's model, has a Markov model of 
order k. We denote by p{l\i) the probability of transiting from state i whose binary 
/c-word isbi = [6^(1), ... , bi{k)] to the state whose word is [6^(2), . . . , 6i(/c), 1]. Given 
a random sequence of length m generated by the source the learner estimates its 
own model's parameters p{l\i) by p(l|i), 1 < i < 2^, which is the frequency of the 
event "6^ is followed by a 1" in the training sequence. We denote by M. the learnt 
model with parameters p(l|i), 1 < i < 2^. We denote by p*{l\i) the transition 
probability from state i of the source model, 1 < i <2^. 

A simulation run is characterized by the parameters, k and m. It consists of a 
training and testing phases. In the training phase we show the learner a binary 
sequence of length m and he estimates the transition probabilities. In the testing 
phase we show the learner another random sequence (generated by the same source) 
of length n and test the learner's predictions on it. For each bit in the test sequence 
we record whether the learner has made a mistake. When a mistake occurs we 
indicate this by a 1 and when there is no mistake we write a 0. The resulting 
sequence of length n is the generalization mistake sequence We denote by ^q^^ 
the binary subsequence of ^"^^^ that corresponds to the mistakes that occured only 
when the learner predicted a 0. 

For a fixed k denote by Nj^^^ the number of runs with a learner of order k and 
training sample of size m. The experimental setup consists of Nk^m = 10 runs with 
1 < A: < 10, m G {100, 200, . . . , 10000} with a total of 100 • 10 • Nk^m = 10000 runs. 
The testing sequence is of length n = 1000. Each run results in a file called system 
which contains a binary vector d whose i^^ bit represents the maximum a posteriori 
decision made at state i of the learner's model, i.e.. 



for 1 < z < 2^. Let us denote by = P{p{l\i) > 1/2), thus di are Bernouli random 
variables with parameters a^, 1 < ^ < 2^. The learner's system is its decision rule 
at every possible state. 

Another file generated is the errorTO which contains the mistake subsequence 
^Q^^ . At the end of each run we measure the lengths of the system file and its com- 
pressed length where compression is obtained via the Gzip algorithm (a variant of 
|11] ) and compute the sysRatio (denoted as p) which is the ratio of the compressed 
to uncompressed length of the system file. Note that p is a measure of informa- 
tion density since it captures the number of bits of useful information (useful for 
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describing the system) there are per bit of representation (in the uncompressed 
file). 

We do similarly for the mistake-subsequence ^q"^^ obtaining the length of the 
compressed file that contains ^q"^^ (henceforth referred to as the estimated algorith- 
mic complexity of ^q^^ since it is an approximation of the Kolmogorov complexity of 
see [8J). We measure the KL-divergence Aq between the probability distribu- 
tion P{w\p) of binary words w of length 4 and the empirical probability distribution 
Pm{'^) cts measured from the mistake subsequence ^q^\ Note, P{w\p) is defined 
according to the Bernouli model with parameter p, that is, P{w\p) = p*(l — p)^"* 
for a word w with i ones, where p is the frequency of ones in the subsequence 
The distribution Pm{'^) equals the frequency of a word w in Hence Aq refiects 
by how much ^q^^ deviates from being random according to a Bernoulli sequence. 



4. Results 

We are interested in the determining the following relationships: (1) the system 
ratio p versus the learner's model order /c, (2) the estimated algorithmic complexity 
^0 of the subsequence ^q^^ versus the p, and (3) the deviation Aq versus p. 

We choose four different levels of learning problems, controlled by the order of 
the source model /c* = 3, 4, 5, 6. For each problem we choose for the source model 
a transition matrix of probabilities p*(l|i) = 1 — p, p*(0|z) = p, where for some 
of the states i we set p = 0.3 and for others p = 0.7, 1 < i < 2^*. Thus the 
Bayes optimal error is 0.3. To ensure that the problem is sufficiently challenging 
we set the first half of the states (those ranging from the A:*-dimensional vector 

00 ... to Oil ... 1) to have p = 0.3 and the second half (10 ... to 11 ... 1) to have 
p = 0.7. This ensures that a Markov model of order k < k* cannot approximate 
the true transition probabilities well, i.e., the infinite-sample limit estimate based 
on a Markov model of order k which is smaller than k* will still be p{l\i) = 0.5, 

1 < i < 2^. But for a Markov model of order k > k* the infinite-sample size 
estimates will converge to the true values of p or 1 — p. 

Before we start to investigate the three relationships stated above we perform 
a sanity check to see how the prediction generalization error (for any of the two 
prediction types, not just when predicting a zero) varies with respect to the model 
complexity k. Figure |4.1| displays this relationship for a learning problem with 
/c* = 3. The curve (with x) is the mean error over all learning runs of a fixed k 
value, the upper and lower curves are the standard deviation above and below the 
mean, respectively. As seen, when the learner's model order k is smaller than k* his 
generalization error stays at the maximum level of 0.5. At /c = /c* there is a drop to 
an error close to the Bayes error of 0.3 Then as k increases beyond k* the mean (as 
well as the standard deviation) of the generalization error start to increase. This is 
due to overfitting of the model to the training data and also because the variance 
of the error estimate increases with k due to the fact that the maximum sample 
size of any run is fixed at m = 10000 and is not increasing with respect to k. 

We now proceed to describe the first result which concerns the relationship be- 



tween the sysRatio p and k. Figure [42] shows the mean and standard deviation of 
the SysRatio p as a function of k. The mean decreases as the learner's model order 
k increases. To explain this, first note that the uncompressed length of the system 
is always c • 2^ for some constant c > since the vector d is of length 2^ (see section 
|3|. The length of the compressed system file also grows, but at a slower rate with 
respect to k and this gives rise to the decrease in p with respect to k. Why is the 
rate of the compressed system file growing more slowly ? 
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Figure 4.1. generalization error versus k for k* = 3 
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The reason is that for values of k < k* the learner's model is incapable (by 
design of the learning problem) of estimating the Bayes optimal prediction and the 
probability of the events "6^ is followed by a 1" is p{l\i) = 1/2 , 1 < i < 2^. Thus the 
average value p{l\i) of the indicators of such events is a Binomial random variable 
with a distribution symmetric at 1/2 and hence from ( |3.1| the probability ai that 
p{l\i) > equals 1/2. The components of the random vector d are independent 
Bernouli random variables with parameter when conditioned on the sample 
size vector v (this is the vector whose components Vi are the number of times 
that bi appeared in the training sequence, see [7J for details). Since in this case 
Q/- = 1/2 then each component has a maximum entropy H{di) = —ai loga^ — (1 — 
ai) log{l — ai) = log 2 = 1 and hence the expected value of the entropy of the 
vector d (with respect to the random sample size vector v) is maximal and equals 

EyH{d\v) = Ey X]i=i H{di\vi) = Ey2^ = 2^. Hence the expected compressed length 
of the system file (which contains the vector d) is large as the expected description 
length of any random variable is at least as large as its entropy. 

As k increases beyond k* the model becomes more capable of estimating the 
true transition probabilities (recall, these are either 0.3 or 0.7) and the probability 
p{l\i) of the events "6^ is followed by a 1" get farther away from 1/2 in the direction 
of 0.3 or 0.7, depending on the particular state z, 1 < z < 2^. Thus the average 
value p{l\i) of the indicators of such events is a Binomial random variable with an 
asymmetric distribution with a mean p{l\i). Hence from (3.1) the probability ai 
that p{l\i) > V2 gets either very close to or 1 as the training size m increases. 
Thus the components of the random vector d tend to be closer to deterministic. 
They are still random since the training sequence length is not increasing with k 
and the variance of the estimates p{l\i) does not converge to zero. Therefore for 
each of the 2^ components of the vector d the entropy is smaller than when k < k* . 
However as there are exponentially many components di, on the whole, the entropy 
of d (and hence the expected compressed length of the system file) still increase but 
at a lower rate than when k < k* . 

Next, we discuss the characteristics of the mistake subsequence Figure 
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shows the graph (with x) of the mean of the estimated algorithmic complexity ^0 of 
^q"^^ versus the mean of the system ratio p on the horizontal axis. The dashed lines 
are the upper and lower envelopes of the standard deviation from the mean. The 
arrow points at the value of p* that corresponds to k* (the source model order). 
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Figure 4.3. Estimated algorithmic complexity Iq of the mistake 
subsequence ^q^^ versus the SysRatio p 



As can be seen, for low values of sysRatio the spread £o is low. There is a sharp 
threshold at p* where the spread around the mean value of io increases significantly. 

Next, Figure 4.4 displays the graph (with x) of the mean of the divergence Aq of 
the mistake subsequence ^q^^ versus the mean of the system ratio p on the horizontal 
axis. The dashed lines are the upper and lower envelopes of the standard deviation 
from the mean. The arrow points at the value of p* that corresponds to k* (the 
source model order). As can be seen, for low values of sysRatio the spread of Aq 
is low. As the result above for io, we see a threshold at p* where the standard 
deviation around the mean value of Aq increases significantly. 
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Figure 4.4. Divergence Aq of the mistake subsequence ^q"^^ versus 
the SysRatio p 



5. Conclusions 

The paper introduces the notion of sysRatio p which is a measure of information 
density of the learner's model. It is similar to the notion of rate of information 
transmission [2] as it measures the ratio of the number of useful information bits 
contained in a file that describes the learner decision rule per bit of representation 
(in the file). The results of this paper depict that this information density infiu- 
ences the level of randomness of the mistakes made by a learner. The sysRatio p 
is a proper measure of complexity of a learner decision rule. It is with respect to 
p that the characteristics of the random mistake subsequence ^q^^ follow what the 
theory [7J predicts. The higher the sysRatio the more significant the deviation Aq 
of ^q"^^ compared to a pure Bernouli random sequence. In addition, we have shown 
that the higher the sysRatio the larger the possible fiuctuations in the algorithmic 
complexity £o of ^o^^- The interesting point is the sharp non-linearity in this re- 
lationship. We showed that there is a threshold p* at which the spread in values 
of ^0 cind Aq increases and it corresponds to the point where the learner's model 
becomes too simple and is incapable of predicting well. 
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